Scatter Plot of TFR Estimates

Introduction

In this short workbook, we outline how using R we can visualise and compare the various computations of the Palau TFR overtime. The data source is the World Population Prospects, United Nations (2022). What we aim to do is compare the different estimates with the aggregate WPP estimates.

Code

Below is the code used to build the scatter plot comparison.

Loading the Packages

First we load the tidyverse() library of R packages (Wickham et al. 2019), which includes the ggplot2() package (Wickham 2016), the dplyr() package (Wickham et al. 2022), and the plotly() package (Sievert 2020). We also use the ggthemes package for cool styling later (Arnold 2021).

library(tidyverse)
library(ggthemes)
library(plotly)

Loading the data

Next we load our two .csv data into R as two tibble objects. One, “e_tfr” is a compilation of all the different estimates for TFR by data source and method. The tibble has four columns: “data_souce, estimate_method, estimated_tfr, estimated_year”. The other, “wpp_tfr” is the WPP approximations for TFR 1960-2020 United Nations (2022)

# load the estimated tfr
e_tfr = read_csv("data/tfr_estimates.csv") %>%
  as_tibble() 

# load the approx tfr 
wpp_tfr = e_tfr %>% filter(data_source == "WPP")

Static Plots

Now, we use ggplot2() to compute the first static data visualisation. This shows all the different estimated TFR from the ‘e_tfr’ tibble.

# create the scatter plot
est = ggplot(e_tfr) +
  geom_line(aes(x = estimated_year,
                y = estimated_tfr,
                color = interaction(estimate_method, data_source))) +
  labs(x = "Estimated Year",
       y = "Estimated TFR",
       title = "Plot of TFR Estimates of Palau 1950-2020",
       color = "") +
  theme_solarized_2() +
  theme(legend.position = "bottom",
        legend.box = "vertical",
        legend.margin = margin()) +
  guides(color = guide_legend(nrow = 18, byrow = TRUE))

est

The next visualisation shows the estimates with just the WPP approximations

# create the scatter plot
wpp = ggplot(wpp_tfr, aes(x = estimated_year,
                          y = estimated_tfr,
                          color = "black"))+
  geom_line() +
  labs(x = "Estimated Year",
       y = "Estimated TFR",
       title = "WPP TFR Estimates for Palau 1950-2020",
       color = "") +
  theme_solarized_2() +
  theme(legend.position = "none")
wpp

Make interactive

Using the ggplotly function from the plotly package, we can make the plot interactive so it is possible to select which data you want to see and zoom + export a custom png. This is very challenging however and could also probably be achieved in Shiny with more ease (though I would have to host a backend which can be $$$)

est_plotly = ggplotly(est, width = 800, height = 2000)


# Create a list of buttons
buttons = list()

# Access the levels of the color scale for the correct order
unique_combinations = unique(e_tfr[c("estimate_method", "data_source")])
unique_combinations = unique_combinations[order(unique_combinations$data_source,
                                                unique_combinations$estimate_method), ]
unique_combinations = apply(unique_combinations, 1,
                            function(row) paste(row, collapse = "-"))

# Create a button for each level in the correct order
for (i in seq_along(unique_combinations)) {
  visibility_vals = rep(FALSE, length(unique_combinations))
  visibility_vals[i] = TRUE
  buttons[[length(buttons) + 1]] = list(
    method = "restyle",
    args = list("visible", visibility_vals),
    label = unique_combinations[i]
  )
}

# add a button to show all interactions
buttons[[length(buttons) + 1]] = list(
  method = "restyle",
  args = list("visible", rep(TRUE, length(unique_combinations))),
  label = "Show All"
)

# adjust layout
est_plotly = est_plotly %>% 
  layout(
    showlegend = FALSE,
    updatemenus = list(
      list(
        type = "buttons",
        direction = "down",
        y = -0.1,
        x = 0,
        xanchor = 'left',
        yanchor = 'top',
        buttons = buttons,
        tickfont = list(size = 5),    # adjust button font size
        pad = list(r = 10, t = 10),   # adjust padding
        showactive = TRUE
      )
    ),
    margin = list(
    b = 800  # adjust bottom margin to make space for buttons
    )
  )

# display the plot
est_plotly

Difference WPP and Estimates

Now I want to compute and visualise the difference betweent the WPP line and the different estimates

# drop WPP from data
calc_tfr = e_tfr %>% filter(data_source != "WPP")

# change cols in wpp
wpp_tfr = wpp_tfr %>% select(estimated_year, estimated_tfr) %>%
  rename(estimated_year = estimated_year, wpp_tfr = estimated_tfr)

# calc difference
calc_tfr = calc_tfr %>%
  left_join(wpp_tfr, by = "estimated_year") %>%
  mutate(difference = estimated_tfr - wpp_tfr)

We can visualise this data per year of estimate as a simple point and line of best fit graph

difference_model = lm(difference ~ estimated_year, data = calc_tfr)
r_squared = summary(difference_model)$r.squared

ggplot(calc_tfr, aes(x = estimated_year,
                     y = difference,
                     color = estimate_method,
                     shape = factor(data_source),
                     group = 1)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              color = "red",
              alpha = 0.3) +
  scale_shape_manual(values = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) +
  labs(title = "Scatter Plot with Line of Best Fit",
       color = "Estimate Method",
       shape = "Data Source") +
  xlab("Year of Estimate") +
  ylab("Difference better Estimated TFR and WPP estimate") +
  geom_text(aes(x = max(estimated_year), y = min(difference), 
                label = paste("R\u00B2 =", round(r_squared, 2))),
            hjust = 1, vjust = -1, color = "dark red") +
  geom_text(aes(x = max(estimated_year), y = min(difference) + 0.1, 
                label = paste("Regression Formula: ",
                              round(coefficients(difference_model)[1], 2),
                              round(coefficients(difference_model)[2], 2),
                               "x")),
            hjust = 1, vjust = -1.5, color = "dark red") +
  theme_solarized_2()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: shape
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Bibliography

Arnold, Jeffrey B. 2021. “Ggthemes: Extra Themes, Scales and Geoms for ’Ggplot2’.” https://CRAN.R-project.org/package=ggthemes.
Sievert, Carson. 2020. “Interactive Web-Based Data Visualization with r, Plotly, and Shiny.” https://plotly-r.com.
United Nations. 2022. “World Population Prospects - Population Division - United Nations.” https://population.un.org/wpp/.
Wickham, Hadley. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. “Dplyr: A Grammar of Data Manipulation.” https://CRAN.R-project.org/package=dplyr.